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Abstract 

A new method for combining several initial estimators of the regres- 
sion function is introduced. Instead of building a linear or convex opti- 
mized combination over a collection of basic estimators r\, . . . ,tm, we 
use them as a collective indicator of the distance between the training 
data and a test observation. This local distance approach is model-free 
and extremely fast. Most importantly the resulting collective esti- 
mator is shown to perform asymptotically at least as well in the I? 
sense as the best basic estimator in the collective. Moreover, it does 
so without having to declare which might be the best basic estimator 
for the given data set. A companion R package called COBRA (stand- 
ing for COmBined Regression Alternative) is presented (downloadable 
on http : //cr an. r- project . org/web/packages/CDBRA/ index. html). 
Numerical evidence is provided on both synthetic and real data sets 
to assess the excellent performance of our method in a large variety of 
prediction problems. 

Index terms — Regression estimation, aggregation, nonlinearity, 
consistency, prediction. 
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1 Introduction 



Recent years have witnessed a growing interest in aggregated statistical proce- 
dures, supported by a considerable research and thorough empirical evidence. 
Indeed, the increasing number of available estimation and prediction meth- 
ods (hereafter also denoted as machines) in a wide range of modern statistical 
problems naturally suggests using some efficient method for combining pro- 
cedures and estimators. If the combined strategy is known to be optimal in 
some sense and relatively free of assumptions that are hard to evaluate, then 
such a model-free strategy is a valuable research tool. 

In this regard, numerous contributions have enriched the aggregation litera- 
ture with various approaches, such as model selection aggregation (select the 
optimal single estimator from a list), convex aggregation (searching for the op- 
timal convex combination of given estimators, such as exponentially weighted 
aggregates) and linear aggregation (selecting the optimal linear combination). 

Model selection, linear-type aggregation strategies and related problems have 
been studied by Catoni (2004), Juditsky and Nemirovski (2000), Nemirovski 
(2000), Yang (2000, 2001, 2004), Gyorfi et al. (2002) and Wegkamp (2003). 
Minimax results have been derived by Nemirovski (2000) and Tsybakov (2003), 
leading to the notion of optimal rates of aggregation. Similar results can be 
found in Bunea et al. (2007a). Further upper bounds for the risk in model 
selection and convex aggregation have been established for instance by Audib- 
ert (2004), Birge (2006) and Dalalyan and Tsybakov (2008). An interesting 
feature is that such aggregation problems may be treated within the scope 
of L 1 -penalized least squares, as performed in Bunea et al. (2006, 2007a,b). 
This kind of framework is also considered by van de Geer (2008) and Koltchin- 
skii (2009), with the L 2 loss replaced by another convex loss. More recently, 
specific models such as single-index in Alquier and Biau (2013) and addi- 
tive models in Guedj and Alquier (2013) have been studied in the context of 
aggregation under a sparsity assumption. 

The present article investigates a distinctly different point of view, motivated 
by the sense that nonlinear, data-dependent techniques are a source of an- 
alytic flexibility and might improve over current aggregation procedures. In 
this regard, consider the following example classification problem: If the en- 
semble of machines happens to have a strong one, lurking but unnamed in the 
collection of which many might be very weak machines, it might make sense 
to consider a more sophisticated method than the previously cited methods 
for pooling the information across the machines. Thus, if one machine has 
an error rate of 5%, say, while most of the other machines have error rates 
close to 35%, then the ensemble approach might reduce the error rate to 25% 
or even 15%, but these are still significantly worse than the strong machine 
rate. Choosing to set aside some of the machines, on some data-dependent 
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criteria, seems only weakly motivated, since the performance of the collective, 
retaining those suspect machines, might be quite good on a nearby data set. 
Similarly, searching for some phantom strong machine in the collective could 
also be ruinous when presented with new and different data. 

Instead of choosing either of these options — selecting out weak performers, 
searching for a hidden, universally strong performer — we propose an original 
nonlinear method for combining the outcomes over some list of plausibly good 
procedures. We call this combined method a regression collective over the 
given basic machines. More specifically, we consider the problem of building a 
new estimator by combining M estimators of the regression function, thereby 
exploiting an idea proposed in the context of classification by Mojirsheibani 
(1999). In words, given a set of preliminary estimators n, . . . ,tm, the idea 
behind the resulting aggregation method is a "unanimity" concept, in that 
it is based on the values predicted by r 1; . . . ,tm for the data and for a new 
observation x. More specifically, a data point is considered to be "close" to 
x, and consequently, reliable for contributing to the estimation of this new 
observation, if all estimators predict values which are close to each other for 
x and this data item, i.e., not more distant than a prespecified threshold e. 
The predicted value corresponding to this query point x is then set to the 
average of the responses of the selected observations. 

To make the concept clear, consider the following toy example illustrated 
by Figure 1. Assume we are given the observation plotted in circles, and 
the values predicted by two known machines f\ and /2 (triangles pointing 
up and down, respectively). The goal is to predict the response for the new 
point x = 0.5. Set a threshold e = 0.2, the black solid circles are the data 
points (xj,?/j) within the two dotted intervals, i.e. such that for m = 1,2, 
|/ m (xj) — / m (xo)| < e. Averaging the corresponding yields the prediction 
for xo (black star). 

Figure 1: A toy example, 
(a) Data points. (b) The collective operates. (c) Predicted value for Xq. 




We stress that the central and original idea behind our approach is that the 
resulting regression predictor is a nonlinear, data-dependent function of the 
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basic predictors ri, . . . ,Tm- To the best of our knowledge there exists no 
formalized procedure in the learning machine and aggregation literature that 
operates as does ours. However, we note that our approach has a concep- 
tual link with the framework described in van der Laan et al. (2007), where 
several estimators are combined using a cross-validation scheme. Since their 
strategy — called Super Learner, SL — is motivated by research concerns simi- 
lar to our own it is reasonable to deploy SL as a benchmark in our study of 
regression collectives. 

Along with this paper, we release the software COBRA (Guedj, 2013) which 
implements the method as an additional package to the statistical software R 
(see R Core Team, 2012). COBRA is freely downloadable on the CRAN web- 
site 3 . As detailed in Section 3, we undertook a lengthy series of numerical 
experiments, over which COBRA proved extremely and surprisingly successful. 
These stunning results lead us to believe that regression collectives can pro- 
vide valuable insights on a wide range of prediction problems. Finally, these 
same results demonstrate that COBRA has remarkable speed in terms of CPU 
timings. In the context of high-dimensional or genomic data, such velocity is 
critical, and in fact COBRA can natively take advantage of multi-core parallel 
environments. 

The paper is organized as follows. In Section 2, we describe the combined 
estimator — the regression collective — and derive a non-asymptotic risk bound. 
Next we present the main result, that the collective is asymptotically at least 
as good as any of the basic estimators. Section 3 is devoted to the companion 
R package COBRA and presents benchmarks of its excellent performance on 
both simulated and real data sets. We also show that COBRA compares favor- 
ably with the SL, the SuperLearner R package, in that it performs similarly 
in most situations, much better in some, while it is consistently much faster 
in every case. Finally, for ease of exposition, proofs are collected in Section 4. 

2 The combined estimator 
2.1 Notation 

Throughout the article, we assume to be given a training sample denoted 
by T> n = ((Xi, Yi), . . . , (X n , Y n )) composed of i.i.d. random variables taking 
their values in M. d x M, distributed as an independent prototype pair (X, Y) 
satisfying KY 2 < oo (with the notation X = (X\, . . . ,Xd)). The space M d 
is equipped with the standard Euclidean metric. For fixed x G M. d , our goal 
is to consistently estimate the regression function r*(x) = E[y|X = x] using 
the data T> n . 

3 http : //cran . r-pro j ect . org/web/packages/COBRA/index . html 
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To begin with, the original data set T> n is split into two data sequences 
V k = ((X 1 ,F 1 ),...,(X fc ,F fc )) and V t = ((X fe+1 , Y k+1 ), (X n , Y n )), with 
£ = n — k > 1. For ease of notation, the elements of T>g are renamed 
((Xi, Yi), . . . , (X^, Xt)). There is a slight abuse of notation here, as the same 
letter is used for both subsets T> k and T>£ — however, this should not cause any 
trouble since the context is clear. 

Now, suppose that we are given a collection of M > 1 competing candidates 
rk,i, • • • , r k ,M to estimate r*. 

These basic estimators — basic machines — are assumed to be generated us- 
ing only the first subsample T> k . These machines can be any among the 
researcheraAZs favorite tool kit, such as linear regression, kernel smoother, 
SVM, Lasso, neural, naive Bayes, or random forests. They could equally well 
be any ad hoc regression rules suggested by the experimental context. The 
essential idea is that these basic machines can be parametric or nonpara- 
metric, or indeed semi-parametric, with possible tuning rules. All what is 
asked for is that each of the rfc )Tn (x), m = 1, . . . , M, is able to provide an 
estimation of r*(x) on the basis of T> k alone. Thus, any collection of model- 
based or model-free machines are allowed, and the collection is here called 
the regression collective. 

Given the collection of basic machines = (rk,i, . . . , t^m), we define the 
collective estimator T n be 

i 

T n (r fc (x)) = WnA^X, x e R d , 
i=i 

where the random weights W n ^(x.) take the form 

2Zj=i 1 n"=i{kfe,m(x)-r fe , m (x j )i< £ a 

In this definition, is some positive parameter and, by convention, 0/0 = 0. 

The weighting scheme used in our regression collective is distinctive but not 
obvious. Starting from (Gyorfi et al., 2002), we see that T n is a local averaging 
estimate in the following sense: The value for r*(x), that is, the estimated 
outcome at the query point x, is the unweighted average over those Yj's such 
that Xj is "close" to the query point. More precisely, "close" means that the 
output at the query point, generated from each basic machine, is within an 
Ei distance of the output generated by the same basic machine at each X, ; 
in the training data. If a basic machine evaluated at Xj is close to the basic 
machine evaluated at the query point x, then the corresponding outcome 
Yi is included in the average, and not otherwise. Also, as a further note of 
clarification: "closeness" of the Xj is not here in the Euclidean sense of close 
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to any other point in the training data, or of the query point to any point 
in the training data. It refers to closeness of the basic machine outputs at 
the query point compared with basic machine outputs over all points in the 
training data. Training points Xji's that are close, in the basic machine sense, 
to the corresponding basic machine output at the query point contribute to 
the indicator function for the corresponding outcome 3^. 

In this context, eg plays the role of a smoothing parameter: Put differently, 
in order to retain Y{, all basic estimators r^i, . . . , t^m have to deliver predic- 
tions for the query point x which are in a ^-neighborhood of the predictions 
rfc^Xj), . . . , rfc M(Xj). Note that the greater eg, the more tolerant the pro- 
cess. It turns out that the practical performance of T n strongly relies on an 
appropriate choice of eg. This important question will thoroughly be discussed 
in Section 3, where we devise an automatic (i.e., data-dependent) selection 
strategy of eg. 

Next, we note that the subscript n in T n may be a little confusing, since T n is 
a weighted average of the Y^s in T>g only. However, T n depends on the entire 
data set T> n , as the rest of the data is used to set up the original machines 
?"fc,i) • ■ • i^k,M- Finally, and most importantly, it should be noticed that the 
combined estimate T n is nonlinear with respect to the basic estimators r^^'s. 
This makes it very different from more model selection, convex and linear 
aggregation. As such, it is inspired by the preliminary work of Mojirsheibani 
(1999) in the supervised classification context. It is also close in spirit to 
the "Super Learner" strategy developed by van der Laan et al. (2007), as 
mentioned earlier. 

Let us finally mention that, in the weights definition (2.1), all original estima- 
tors are asked to have the same opinion on the importance of the observation 
Xj (within the range of en) for the corresponding 3^ to be integrated in the 
combinaison T n . However, this unanimity constraint may be relaxed by im- 
posing, for example, that a fixed fraction a G (0, 1] of the machines agree on 
the importance of Xj. In that case, the weights take the more sophisticated 
form 



It turns out that adding the parameter a does not change the asymptotic 
properties of T n , provided a — > 1. Thus, to keep a sufficient degree of clarity 
in the mathematical statements and subsequent proofs, we have decided to 
consider only the case a = 1 (i.e., unanimity). We leave as an exercise 
the possibility to extend the results to more general values of a. On the 
other hand, as highligthed by Section 3, a has a non-negligible impact on 
the performance of the combined estimate. Accordingly, we will discuss in 
Section 3 an automatic procedure to select this extra parameter. 




{E m .ll{|r tm (x)-r tm (X,)|< £< }>Ma) 
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2.2 Theoretical performance 

This section is devoted to the study of some asymptotic and non-asymptotic 
properties of the combined estimate T n , whose quality will be assessed by the 
quadratic risk 

E|T n (r fc (X))-r*(X)| 2 . 

Here and later, E denotes the expectation with respect to both X and the 
sample V n . Throughout, we let 

T(r*(X)) = E[Y\r k (X}\ 

and note that, by the very definition of the L 2 conditional expectation, 

E \T(r k (X)) - Y\ 2 < inf E |/(r fc (X)) - Y\ 2 , (2.2) 

where the infimum is taken over all square integrable functions of rfc(X). 

Our first result is a non-asymptotic inequality, which states that the combined 
estimator behaves as well as the best one in the original list, within a term 
measuring how far T n is from T. 

Theorem 2.1. Let r^ = (r&i, . . . ,r kt M) be the collection of basic estimators, 
and let X„(r n (x)) be the combined estimate. Then 

E|T n (r fe (X))-r*(X)| 2 < min E |r fc m (X) — r*(X)| 2 

m=l,...,M 

+ E|T„(r fc (X))-T(r fc (X))| 2 , 
for all distributions of (X,F) with KY 2 < oo. 

Theorem 2.1 reassures us on the performance of T n with respect to the basic 
machines, whatever the distribution of (X, Y) is and regardless of which in- 
dividual estimate is actually the best. The term E|T n (r,t(X)) — T(rfc(X))| 2 is 
a variance-type term, which can be asympotically controlled. 

Proposition 2.1. Assume that 

Si — > and ief 1 — > oo as £ — )■ oo. 

Then 

E|T n (r fc (X))-T(r fe (X))| 2 ^0 as £ -> oo, 
for all distribution of (X,F) with KY 2 < oo. 

Thus, combining Theorem 2.1 and Proposition 2.1, we obtain 

limsu P E|T n (r fc (X))-r*(X)| 2 < min E |r fc m (X) — r*(X)| 2 . 

f^r^r, m=l,...,M 
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This result is remarkable, for at least two reasons. Firstly, it shows that, in 
terms of predictive quadratic risk, the combined estimate does asymptotically 
at least as well as the best primitive machine. Secondly, the result is universal, 
in the sense that it is true for all distributions of (X, Y), without exceptions. 
This is especially interesting because the performance of any estimation pro- 
cedure eventually depends upon some model and smoothness assumptions on 
the observations. For example, a linear regression fit performs well if the 
distribution is truly linear, but may behave poorly otherwise. Similarly, the 
Lasso procedure is known to do a good job for non-correlated designs (see 
van de Geer, 2008), with no clear guarantee however in adversarial situa- 
tions. Likewise, rates of convergence of nonparametric procedures such as 
the fc-nearest neighbor method, kernel estimates and random forests dramat- 
ically deteriorate as the ambient dimension increases, but may be significantly 
improved if the true underlying dimension is reasonable. This phenomenon 
is thoroughly analyzed for the random forests algorithm in Biau (2012). The 
crux is that model and smoothness assumptions are usually unverifiable, espe- 
cially in modern, high-dimensional and large scale data sets. To circumvent 
this difficulty, people often try many different methods and retain the one 
exhibiting the best empirical results. Our aggregation strategy offers a nice 
alternative, in the sense that if one of the initial estimators is consistent for 
a given smoothness class Ai of distributions, then T n inherits the same prop- 
erty. Our procedure therefore allows the statistician to consider model-free 
prediction. This is formalized in the following corollary. 

Corollary 2.1. Assume that one of the original estimators, say rk )mo , satis- 
fies 

E|r fc>mo (X) -r*(X)| 2 -»■ as k oo 
for all distribution of (X,Y) in some smoothness class M.. Then, if 

Ei — > and iej 1 — > oo as £ — )■ oo, 

one has 

E|T n (r fc (X)) -r*(X)| 2 -»■ as k,£ -too, 
for all distribution of (X, Y) in A4. 

3 Implementation and numerical studies 

This section is devoted to the implementation of the described method. Its ex- 
cellent performance is then assessed in a series of benchmarks. The companion 
R package COBRA (standing for COmBined Regression Alternative) is available 
on the CRAN website http : / / cran. r- project . org/web/packages/COBRA/ 
index.html, for Linux, Mac and Windows platforms, see Guedj (2013). 
COBRA includes a parallel option, allowing for improved performance on 
multi-core computers (see Knaus, 2010). 
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As raised in the previous section, a fine calibration of the smoothing param- 
eter Ei is crucial. Clearly, a too small value will discard many machines and 
most weights will be zero. Conversely, a large value sets all weights to 1/E 
with 

l 

S = 5Z 1 ni=i{|r' fc , m (x)-r fc , m (X J )|< £ ,}' 

giving the naive predictor that does not take into account any new data point 
and predict the mean over sample T>£. We also consider a relaxed version of 
the unanimity constraint: Instead of requiring global agreement over the im- 
plemented machines, consider some a G (0, 1] and keep observation Y$ in the 
construction of T n if and only if at least a proportion a of the machines agree 
on the importance of Xj. This parameter requires as well a fine calibration. 
To understand better, consider the following toy example: On some data set, 
assume most machines but one have nice predictive performance. For any new 
data point, requiring global agreement will fail since the pool of machines is 
heterogeneous. In this regard, a should be seen as a measure of homogeneity: 
If a small value is selected, it should be seen as an indicator that some ma- 
chines perform (possibly much) better than some others. Conversely, a large 
value indicates that the predictive abilities of the machines are close. 

A natural measure of the risk in the prediction context is the empirical 
quadratic loss, namely 

where Y = (Y"i, . . . ,Ye) is the vector of predicted values for the responses 
Y 1 , . . . ,Y e . 

We adopted the following protocol: Using a simple data-splitting device, Ee 
and a are chosen by minimizing the empirical risk v over the set {s^mim • • • i ^,max}^ 
{1/M, . . . , 1}, where £^ iTO i n = 10~ 9 and ££ imax is the largest difference between 
two predictions of the pool of machines. In the package, #{e£, m in, • ■ ■ > ^.max} 
may be modified by the user, otherwise the default value 100 is chosen. Fig- 
ure 2 illustrates the discussion about the choice of eg and a. 

By default, COBRA includes the following classical packages dealing with re- 
gression estimation and prediction. However, note that the user has the choice 
to modify this list to her/his own convenience. 

• Lasso (R package lars, see Hastie and Efron, 2012), 

• Ridge regression (R package ridge, see Cule, 2012), 

• fc-nearest neighbors (R package FNN, see Li, 2012), 

• CART algorithm (R package tree, see Ripley, 2012), 
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• Random Forest algorithm (R package randomForest, see Liaw and 
Wiener, 2002). 

First, COBRA is benchmarked on synthetic data. For each of the following 
eight models, two designs are considered: Uniform over (—1,1) (referred to 
as "Uncorrelated" in Table 1, Table 2 and Table 3), and Gaussian with mean 
and covariance matrix E with £jj = 2~' J_J ' ("Correlated"). Models considered 
cover a wide spectrum of contemporary regression problems. Indeed, Model 2 
comes from van der Laan et al. (2007), Model 3 and Model 4 appear in Meier 
et al. (2009). Model 1 and Model 5 are classic settings. Model 6 is about 
predicting labels, Model 7 is inspired by high-dimensional sparse regression 
problems. Finally, Model 8 deals with probability estimation, linking with 
nonparametric model-free approaches such as in Malley et al. (2012). In the 
sequel, we let Af([J>, cr 2 ) denote a Gaussian random variable with mean n and 
variance a 2 . In the simulations, the training data set was usually set to 80% 
of the whole sample, then split into two equal parts corresponding to T>k and 
V t . 

Model 1. n = 800, d = 50, Y = X 2 + exp(-Xf). 

Model 2. n = 600, d = 100, Y = X 1 X 2 +Xf-X 4 X 7 +X 8 X 10 -X 2 +./v'(0, 0.5). 

Model 3. n = 600, d = 100, Y = -sin(2Jfi) + X 2 + X 3 - exp(-X 4 ) + 
W(0,0.5). 

Model 4. n = 600, d = 100, Y = X 1 + (2X 2 - l) 2 + sin(27rX 3 )/(2 - 
sin(27rX3))+sin(27rX4)+2cos(27rX4)+3sin 2 (27rX4)+4cos 2 (27rX 4 )+Ar(0,0.5). 

Model 5. n = 700, d = 20, Y = l {Xl >o} + X| + l{x 4 +x 6 -x 8 -x 9 >i+x 14 } + 
exp(-X 2 ) +W(0,0.5). 

Model 6. n = 500, d = 30, Y = !{x|<o} - l{Af(o,i)>i.25}. 

Model 7. n = 600, d = 300, Y = X 2 + X 2 X 3 exp(-|X 4 |) + X 6 - X 8 + 
AT(0,0.5). 

Model 8. n = 600, d = 50, Y = l{x 1+ xf+x g +sm(x 12 x 18 )+A/-(o,o.i)>o.38}- 

Table 1 presents the mean quadratic error and standard deviation over 100 
independent replications, for each model and design. Bold number identifies 
the lowest error, i.e., the best competitor. Boxplots of errors are presented 
in Figure 3 and Figure 4. Further, Figure 5 and Figure 6 shows the predic- 
tive capacities of COBRA, and Figure 7 depicts its ability to reconstruct the 
functional dependence over the covariates when this dependence is additive, 
assessing the striking performance of our approach in a wide spectrum of sta- 
tistical settings. A remarkable fact is that COBRA performs at least as well as 
the best machine, and improves even significantly in Model 3, Model 5 and 
Model 6. 
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Next, we compare COBRA to the SuperLearner algorithm (Polley and van der 
Laan, 2012). This widespread algorithm was first described in van der Laan 
et al. (2007). SuperLearner is used in this section as the key competitor 
to our method: In a spirit close to ours, the main idea lies on a nonlinear 
way to combine basic estimators based on cross-validation. We feel close to 
the approach used in the SuperLearner package, allowing the user to add as 
many machines as desired, then blending them to deliver predictive outcomes. 

Table 2 summarizes the performance of COBRA and SuperLearner (used with 
SL . randomForest, SL. ridge and SL.glmnet, so that both methods compete 
on equal terms) through the described protocol. Both methods compete on 
similar terms in most models, although COBRA proves much more efficient on 
correlated design in Model 2 and Model 4. This already remarkable result 
is to be stressed by the flexibility and velocity showed by COBRA. Indeed, 
as emphasized in Table 3, without even using the parallel option, COBRA 
obtains similar or better results than SuperLearner roughly five times faster. 

Next, COBRA is used to process the following real-life data sets. 

• Concrete Slump Test 4 (see Yeh, 2007), 

• Concrete Compressive Strength 5 (see Yeh, 1998), 

• Wine Quality 6 (see Cortez et al., 2009). Note that this data set involves 
supervised classification and opens a line for future research since COBRA 
is mainly devoted to regression. 

The good predictive performance of COBRA is summarized in Figure 8 and 
errors are presented in Figure 9. For every data set, the sample is divided 
into a training set (90%) and a testing set (10%) on which the predictive 
performance is evaluated. 

As a conclusion to this thorough experimental protocol, COBRA sets a new 
gold standard for prediction-oriented problems in the context of regression. 



4 http : //archive . ics .uci . edu/ml/datasets/Concrete+Slump+Test. 

5 http : //archive . ics .uci . edu/ml/datasets/Concrete+Compressive+Strength. 

6 http : //archive . ics .uci . edu/ml/datasets/Wine+Quality. 



11 



Table 1: Quadratic errors of the implemented machines and COBRA. Means 
and standard deviations over 100 independent replications. 
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0.1279 


0.2243 


0. 


.1715 


0.1236 


0.1021 


8 


sd. 


0.0120 


0.0161 


0.0189 


0. 


.0270 


0.0100 


0.0155 



Correlated 




lars 


ridge 


f nn 


tree 


rf 


COBRA 


Model 1 


m. 


2.3736 


1.9785 


2.0958 


0.3312 


0.5766 


0.3301 


sd. 


0.4108 


0.3538 


0.3414 


0.1285 


0.1914 


0.1239 


Model 2 


m. 


8.1710 


4.0071 


4.3892 


1.3609 


1.4768 


1.3612 


sd. 


1.5532 


0.6840 


0.7190 


0.4647 


0.4415 


0.4654 


Model 3 


m. 


6.1448 


6.0185 


8.2154 


4.3175 


4.0177 


3.7917 


sd. 


11.9450 


12.0861 


13.3121 


11.7386 


12.4160 


11.1806 


Model 4 


m. 


60.5795 


42.2117 


51.7293 


9.6810 


14.7731 


9.6906 


sd. 


11.1303 


9.8207 


10.9351 


3.9807 


5.9508 


3.9872 


Model 5 


m. 


6.2325 


7.1762 


10.1254 


3.1525 


4.2289 


2.1743 


sd. 


2.4320 


3.5448 


3.1190 


2.1468 


2.4826 


1.6640 


Model 6 


m. 


1.2765 


1.5307 


2.5230 


2.6185 


1.2027 


0.9925 


sd. 


0.1381 


0.9593 


0.2762 


0.3445 


0.1600 


0.1210 


Model 7 


m. 


20.8575 


4.4367 


5.8893 


3.6865 


2.7318 


2.9127 


sd. 


7.1821 


1.0770 


1.2226 


1.0139 


0.8945 


0.9072 


Model 8 


m. 


0.1366 


0.1308 


0.2267 


0.1701 


0.1226 


0.0984 


sd. 


0.0127 


0.0143 


0.0179 


0.0302 


0.0102 


0.0144 
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Table 2: Quadratic errors of 
SuperLearner and COBRA. Means 
and standard deviations over 100 
independent replications. 



Table 3: Average CPU-times 
in seconds. No parallelization. 
Means and standard deviations 
over 10 independent replications. 



Uncorr. 




SL 


COBRA 


Model 1 


m. 


0.0541 


0.0320 




u.uuoo 




Model 2 


m. 


0.1765 


0.3569 


sd. 


0.0167 


0.8797 


Model 3 


m. 


0.2081 


0.2573 


sd. 


0.0282 


0.0699 


Model 4 


rn. 


4.3114 


3.7464 


sd. 


0.4138 


0.8746 


Model 5 


rn. 


0.2119 


0.2187 


sd. 


0.0317 


0.0427 


Model 6 


m. 


0.7627 


1.0220 


sd. 


0.1023 


0.3347 


Model 7 


m. 


0.1705 


0.3103 


sd. 


0.0260 


0.0490 


Model 8 


m. 


0.1081 


0.1075 


sd. 


0.0121 


0.0235 



Corr. 




SL 


COBRA 


Model 




m. 


0.8733 


0.3262 


1 


sd. 




0.2740 


0.1242 


Model 




rn. 


2.3391 


1.3984 


2 


sd. 




0.4958 


0.3804 


Model 




m. 


3.1885 


3.3201 


3 


sd. 




1.5101 


1.8056 


Model 


4 


m. 


25.1073 


9.3964 


sd. 


7.3179 


2.8953 


Model 




m. 


5.6478 


4.9990 


5 


sd. 




7.7271 


9.3103 


Model 


6 


m. 


0.8967 


1.1988 


sd. 


0.1197 


0.4573 


Model 


7 


m. 


3.0367 


3.1401 


sd. 


1.6225 


1.6097 


Model 




m. 


0.1116 


0.1045 


8 


sd. 








0.0111 


0.0216 



Uncorr. 

Model 1 

Model 2 
Model 3 
Model 4 
Model 5 
Model 6 
Model 7 
Model 8 

Corr. 



m. 

sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 



SL 

53.92 

1.42 
57.96 

0.95 
53.70 

0.55 
55.00 

0.74 
28.46 

0.73 
22.97 

0.27 
127.80 

5.69 
32.98 

1.33 

SL 



Model 1 
Model 2 
Model 3 
Model 4 
Model 5 
Model 6 
Model 7 
Model 8 



m. 

sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 
m. 
sd. 



61.92 

1.85 
70.90 

2.47 
59.91 

2.06 
63.58 

1.21 
31.24 

0.86 
24.29 

0.82 
145.18 

8.97 
31.31 

0.73 



COBRA 

10.92 
0.29 

11.90 
0.31 

10.66 
0.11 

11.15 
0.18 
5.01 
0.06 
3.99 
0.05 

35.67 
1.91 
6.46 
0.33 

COBRA 



11.96 

0.27 

14.16 
0.57 

11.92 
0.41 

13.11 
0.34 
5.02 
0.07 
4.12 
0.15 

41.28 
2.84 
6.24 
0.11 
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Figure 2: Examples of calibration of parameters Ee and a. The bold point is 
the minimum. 



(a) Model 3, correlated design. 



(b) Model 4, uncorrelated design. 





1 2 3 4 5 6 7 



(c) Model 5, uncorrelated design. 



(d) Model 5, correlated design. 
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Figure 3: Boxplots of quadratic errors, uncorrelated design. From left to 
right: lars, ridge, fnn, tree, randomForest, COBRA. 

(a) Model 1. (b) Model 2. (c) Model 3. (d) Model 4. 




(e) Model 5. (f) Model 6. (g) Model 7. (h) Model 8. 




Figure 4: Boxplots of quadratic errors, correlated design. From left to right: 
lars, ridge, fnn, tree, randomForest, COBRA. 

(a) Model 1. (b) Model 2. (c) Model 3. (d) Model 4. 




(e) Model 5. (f) Model 6. (g) Model 7. (h) Model 8. 
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Figure 5: Prediction over the testing set, uncorrected design. The more 
points on the first bissectrix, the better the prediction. 



(a) Model 1. 



(b) Model 2. 



(c) Model 3. 



(d) Model 4. 



(e) Model 5. 



(f) Model 6. 



(g) Model 7. 



(h) Model 
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Figure 6: Prediction over the testing set, correlated design. The more points 
on the first bissectrix, the better the prediction. 



(a) Model 1. 



(b) Model 2. 



(c) Model 3. 



(d) Model 4. 



(e) Model 5. (f) Model 6. (g) Model 7. (h) Model 8. 
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Figure 7: Examples of reconstruction of the functional dependencies, for co- 
variates 1 to 4. 

(a) Model 1, uncorrelated design. (b) Model 1, correlated design. 
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(c) Model 3, uncorrelated design. (d) Model 3, correlated design. 
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Figure 8: Prediction over the testing set, real-life data sets. 



(a) Concrete Slump(b) Concrete Com-(c) Wine Quality, red(d) Wine Quality. 
Test. pressive Strength. wine. white wine. 



Figure 9: Boxplot of quadratic errors, real-life data sets. 



(a) Concrete Slump(b) Concrete Com-(c) Wine Quality, red(d) Wine Quality, 
Test. pressive Strength. wine. white wine. 
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4 Proofs 

4.1 Proof of Theorem 2.1 

For each m = 1, . . . , M, we have 

< E|r fc , m (X) - Y\ 2 - E|T(r fc (X)) - Y\ 2 
= E|r fe , m (X) - Y\ 2 - E|r*(X) - Y\ 2 + E|r*(X) - F| 2 

- E|T n (r fc (X)) - y | 2 + E|T„(r fc (X)) - y | 2 - E|T(r fc (X)) - Y | 2 , (4.1) 

where we used that E |T(r fc (X)) - y | 2 < inf f E |/(r*(X)) - Y | 2 . Observe 
now that 

E|r fe , m (X) - y| 2 = E|r fc , m (X) - r*(X)| 2 + E|r*(X) - y| 2 , (4.2) 

since 

E[(r fc , OT (X)-r*(X))(r*(X)-y)] 

= E[E[(r fc , m (X) - r*(X))(r*(X) - Y)\V k , X]] 
= E[(r fc , m (X) - r*(X))E[r*(X) - Y |X]] 
= E[(r fc , m (X) - r*(X))(r*(X) - r*(X))] 
= 0. 

Likewise, 

E|T n (r fe (X)) - y | 2 = E|T n (r fc (X)) - r*(X)| 2 + E|r*(X) - Y | 2 

and 

E|T n (r fc (X)) - y| 2 = E|T n (r*(X)) - T(r fc (X))| 2 + E|T(r fc (X)) - y| 2 . 
Combining these equalities reveals that the expression in (4.2) equals 
E|r fc , m (X) - r*(X)| 2 - E|T„(r fe (X)) - r*(X)| 2 + E|T n (r fc (X)) - T(r fc (X))| 2 . 
It follows that 

E|T n (r fc (X)) - r*(X)| 2 < E|r fc , m (X) - r*(X)| 2 + E|T„(r fc (X)) - T(r fc (X))| 2 . 
Taking the infimum over m — 1, . . . , M leads to 

E|T n (r fc (X))-r*(X)| 2 < min E|r fc , m (X) - r*(X)| 2 

m=l,...,M 

+ E|T n (r,(X))-T(r fc (X))| 2 . 

This is the desired result. 
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4.2 Proof of Proposition 2.1 

We start with a technical lemma, whose proof can be found in the monograph 
by Gyorfi et al. (2002). 

Lemma 4.1. Let B(n,p) be a binomial random variable with parameters 
n > 1 and p > 0. Then 



and 



E 



E 



1 + B(n,p) 

l{B(n,p)>0} 

B(n,p) 



< 



< 



p(n + 1) 
2 

p(n + 1) ' 



For all distribution of (X, Y), using the elementary inequality (a + b + c) 2 < 

3 (a 2 + 6 2 + c 2 ), note that 

E|T n (r fe (X))-T(r fc (X))| 2 
= E 



^ W n>i (X) (1- - T(r fc (Xi)) + T(r fc (Xi)) - T(r,(X)) + T(r fc (X))) 

i=i 

- r( rjfc (x)) 

< 32 ^^(XJ^r^Xi)) -T(r fc (X))) 
i 

i 

^W n , i (X)(Y i -T(r fc (X < ))) 



i=i 



+ 3E 



+ 3E 



i=l 



^(^(X) - l)T(r fc (X)) 



i=l 



(4.3) 
(4.4) 
(4.5) 



Consequently, to prove the proposition, it suffices to establish that (4.3), 
(4.4) and (4.5) tend to as i tends to infinity. This is done, respectively, in 
Proposition 4.1, Proposition 4.2 and Proposition 4.3 below. 

Proposition 4.1. Under the assumptions of Proposition 2.1, 

2 



lim E 

l— >oo 



^W^(X)(T(r fc (X^)-T(r,(X))) 



i=l 



0. 
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Proof of Proposition 4-1- By the Cauchy-Schwarz inequality, 



E 



J]vMX)(T(r fe (XO) -T(r k (X))) 



i=l 



E 



< E 



t=l 

i I 

WnjiX) Wn,*(X) |T(r fc (X,)) - T(r fe (X))| : 



.i=i 



E 



^W n ,(X)|T(r fe (X,))-T(r fc (X))r 



:— A n . 

The function T is such that E[T 2 (rfe(X))] < oo. Therefore, it can be approx- 
imated in an L 2 sense by a continuous function with compact support, say 
T. This result may be found in many references, amongst them Gyorfi et al. 
(2002, Theorem A.l). More precisely, for any r] > 0, there exists a function 
T such that 



E 



T(r fe (X))-f(r fe (X)) 



< rj. 



Consequently, we obtain 



A n = E 



< 3E 



+ 3E 



+ 3E 



Y,W n>i {yL)\T(Y k {yU)) -T{r k {X))\ 
=1 

e 

^W nii (X)|T(r fc (X)) -f^X,))! 2 
i 

e 



i=i 



i=l 



^^,(X)|f(r fe (X))-T(r fc (X))p 



i=i 



:= 3A nl + 3A n2 + 3 A 



n3 • 



Computation of A n3 . Thanks to the approximation of T by f, 



A n3 = E 



^W n>i (X)|T(r fc (X))-f(r fc (X) 



<E T(r fc (X))-T(r fc (X)) 



< rj. 
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Computation of A n \. Denote by \x the distribution of X. Then, 



E 



^W n<i (X)\f MX,)) -T(r fe (X,))p 



i=i 



= £E 



nif =1 {|r fc , m (X)-r fc , m (X 1 )|< e ,} |f (rfc(Xl)) _ T^CXx))! 1 

E J= i 1 n^ =1 {k fc , m (x)-r fc , m (x J )|< £ ,} 



1 1 |f(r fe (u))-T(r fc (u))p 



xE 



L n^=i{kfc,m(x)-r- fc , m (u)|< £ a 



1 n*=i{k fe , m (x)-r fc , m (u)|< £ ,} + Ej=2 1 n*=i{k fc , m (x)-r fc , m (X 3 )|< £ a 

Let us prove that 



d/i(x) 



d/i(u). 



4,1 = E 



L n™=i{k fc , m (x)-r fcjm (u)i< £ a 



n^ = i{kfc,m(x)-r fc , m (u)|< Q } + Ej=2 1 n^ 1 {kfc, m (x)-r fc , m (X J )|< £ a 



-d/i(x) 



)M 



< 



To this aim, observe that 



E 



P =i 



1 + Ej=2 1 {X J Gn* f =1 r- i 1 m ([r fe , m (x)- ££ ,r fc , m (x)+ e ,])} 

1 { x gU( ai ,..., aM)e{ i,2 } Mr t -l(J:; i (u))n...nr-; M (J^(u))} 

1 + Ej=2 1 {X ; enLiri(h, m (x)-«,r t , m (x)+6,])} 



L{xei?S(u)} 



1 + Ej=2 1 {X J Gn^ =1 r- 1 m ([r fe , m (x)- e ,,r fe , m (x)+ £ ,])} 



d//(x) 
dyu(x) 
d/x(x) 



Here, 7^ m (u) = [r fc , m (u) - e e , r k , m (u)], I% >m (u) = [r fe , m (u), r k , m (u) + e e ], and 
i2£(u) is the p-th set of the form r^(/°\(u)) D . . . fl r^(/^(u)) assuming 
that they have been ordered using the lexicographic order of (ai, . . . , Om). 

Next, note that 



M 



x G i£(u) =*► <(u) C f| r k *J[r k , m (x)-e e ,r k , m (yL)+e t ]). 



m=l 



To see this, just observe that, for all m — 1, . . . , M, if r ktTn (z) G [rfe, m (u) — 
£t,rk,m(v)], i.e., r fc)TO (u) - ^ < r fc>ro (z) < r fc , m (u), then, as r fejm (u) - ^ < 
rfc, m (x) < r fc , m (u), one has r fc , m (x) - e< < r fc>m (z) < r fc , m (x) + e e . Similarly, 
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if r k , m (u) < r fc>m (z) < r k , m {u) + e £ , then r fejTO (u) < r k , m (x) < r k<m (u) + e e 
implies r fcjm (x) - e t < r k>m (z) < r fcim (x) + e e . Consequently, 



2 M 

<i<E E 

P =i 

= E E 

p=i 

2 M 
< 



■{xe«(u)} 



E 



1 + Ej=2 ijXje^H} 
#)} 



1 + Ej= 2 ijx.e^H} 



d/i(x) 



(by the first statement of Lemma 4.1). Thus, returning to A n i, we obtain 

2 



A,, < 2 M E 



T(r fc (X) - T(r»(X))) 



< 2*V 



Computation of A n2 . For any 5 > 0, write 



A n2 = E 



E 



4=1 



+ E 



E^ n ,(X)|f(r fc (X J ))-f(r fc (X))| 

4=1 
=1 

t 

E^ M (X)|f(r fc (X0)-T(r fe (X))| 2 l n ^ i{|r ^ m(x) _ r ^ m(Xj)| <, } 



4=1 



<4sup |T(r fc (u))| 2 E 



Ew / nii (X)l U A^ i{|rfcm{x) _ rftm(Xi)|>(5} 
|T(r fc (v))-T(r fc (u))| 



4=1 



sup 

u . v eIR d ,ni=i{l'-fc,m(u)-r fc , m (v)|<<5} 

With respect to the term (4.6), if 5 > e#, then 



(4.6) 
(4.7) 



4=1 



E 

4=1 



= i{kfe,m(X)-r fcim (Xi)|>(5} 



L n" = i{kfe i m(X)-r fe , m (X i )|<^}- L U™ = i{kfc,m(X)-r fc , ra (X i )|>5} 



Sj=l 1 n*=i{kfc,m(X)-r fc , m (X ; 



)l<^} 
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It follows that, for all 5 > 0, this term converges to as £ tends to infinity. On 
the other hand, letting 5 — > 0, we see that the term (4.7) tends to as well, by 
uniform continuity of T. Hence, A n 2 tends to as £ tends to infinity. Letting 
finally rj go to 0, we conclude that A n vanishes as £ tends to infinity. □ 

Proposition 4.2. Under the assumptions of Proposition 2.1, 



lim E 

l— >oo 



Y^WnjiXXYi-TiTkiXi))) 



i=l 



0. 



Proof of Proposition 4 ■ 2. 



E 



i=l 



^^E[^(X)W nJ (X)(y i -T(r fe (X i )))(^-T(r fe (X i )))] 

i=l j=l 



E 



E 



XX^i^-Tfacx,))! 5 

1=1 
I 



i=l 



where 



a 2 (r fc (x))=E[|F-T(r fc (X))| 2 |r fc (x)]. 



For any rj > 0, using again Gyorfi et al. (2002, Theorem A.l), a 2 can be 
approximated in an L l sense by a continuous function with compact support 



a 2 , i.e., 



Thus 



E|a 2 (r fc (X))-a 2 (r fc (X))| <rj. 



E 



< E 



i=i 



E 



i=l 



< sup |a 2 (r fc (u))|E 



E<*( x ) 



+ E 



^^(X)^ 2 ^^^)) -a 2 (r fc (X,))| 



i=l 
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With the same argument as for A n i, we obtain 



E 



i=l 



<2 M V . 



Therefore, it remains to prove that E 
aim, fix 5 > 0, and note that 

w 2 (x) — j n^'.ii '■/.■. ,„ix.)-a.,„(x,) .-, } 



— >• as £ — >■ oo. To this 



i=i 



(Ej=1 1 n™=i{k fc , m (X)-r fc , m (X J )|< ££ }) 



< min < 5, 



£)i=l 1 rim=i{kfc,m(X)-r fc , m (Xi)|<Q} 

{^= ll n^ =1 {k fc , m (X)-r fc>m (X i )|<e^} >0 } 



1 n^i{kfc,m(x)-r fc , m (x l )i< £ a 

To complete the proof, we have to establish that the expectation of the right- 
hand term tends to 0. Denoting by I an arbitrary interval on the real line, 
we have 

\t: , i j- , „ , .. . i "! 

E ' 



< E 



{x ie n£f =1 r k ^([r fc , m (X)- S£ ,r fc>m (X)+^])} ' 

Ei=l 1 {x i erim=i rfc^([r fc , m (X)- e£l r fcltn (X)+ e£ ])} 
1 

' y £ i 

Zji=1 {Xi6n* f =1 r^J ri ([r fcim (X)-e f ,r fcjm (X)+e < ])} 



>0 



■{xen^ =1 r-^(/)} 



Ei=i 1 {x i en™=i ^L([^,-(X)- £ ,,r fc , m (X)+ e ,])} 

M 

+KU^i(n) 



E 



E 



m=l 
1 



{Xien^ =1 '-fc j J rl (['-fc, m (X)-^,r fcim (X)+^])} 



>0 



L {x e n^ = i^-,L«} 



Ei=i 1 {x i en„=i vL(['"fe,m(X)- e ,,r fc , m (X)+ e ,])} 



M 



m=l 



< 



E 



-{xerr =1 r- fc ->)} 



MfL=l r fcI(Km( X ) - ^,m(X) + £*])) 



M 



m=l 
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The last inequality arises from the second statement of Lemma 4.1. By an 
appropriate choice of /, the second term on the right-hand side can be made 
as small as desired. Regarding the first term, there exists a finite number N/> 
of points Zi, . . . , Z]y e such that 



M 



fl r i( J ) c U r kl(in,i(z n )) n • • • n r^ M (i n>M (z jM )), 



m=l 



where I n ^ m (zj) = [zj — £g/2,Zj + £g/2\. Suppose, without loss of generality, 



that the sets 



^,1(^,1(^1)) n • • • n r k Ul n:M (z jM )) 



are ordered, and denote by the p-th among the Nf 1 = (\\I\/ei\) M sets. 
Here |/| denotes the length of the interval / and \x] denotes the smallest 
integer greater than x. For all p, 



M 



(xj - £ t ,r k to +<*])■ 



m=l 



Indeed, if v G R v n , then, for all m = 1, . . . , M, there exists j G {1, . . . , N n } 
such that r k<m {y) G [zj — £g/2,Zj + £e/2], that is Zj — Eg/2 < r k<m (y) < 
zj + Ee/2. Since we also have zj — Sfj2 < rfe )J7l (X) < zj + se/2, we obtain 
rfe, m (X) - £f < r fcjm (v) < r fc , m (X) + e £ . In conclusion, 



E 



-{xen^ =1 r-^(/)} 



-{xe^} 



M(flm=l r fe,m([ r fc,m(X) - E h r fc , m (X) + £*])) 

P =i 



Mn m =l r fci([ r fc,m( X ) - r fc,m(X) + £ t ])) 



p=l 



l {xe<} 

IarW 



The result follows from the assumption lim^oo zejF = 00. 
Proposition 4.3. Under the assumptions of Proposition 2.1, 



□ 



lim E 



t— >oo 



E(W- n>i (X) - l)T(r fc (X)) 



i=l 



0. 
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Proof of Proposition 4-3. Since | Y2i=i W n) i(X) — 1| < 1, one has 



J2(W n 4X) - l)T(r fc (X)) 



8=1 



< T 2 (r fe (X)). 



Consequently, by Lebesgue's dominated convergence theorem, to prove the 
proposition, it suffices to show that W ni (X) tends to 1 almost surely Now, 



p E 1 ^ 



=i{|'"fc,m(X)-r fc)m (Xi))|<e J {} 



,i=l 
t 



8=1 



fl£=l ^J n ([r fc , m (X)- Q ,r fc , m (X)+ £ ,])} 



| p = i, . . . ,^ i {Xienli r - ;([rt , ra(x) „,, rfcim(x)+£(1)} = o ) dMx) 

[l - M n m=l^m (Km(x) - r fe , m (x) +£|]))] d/i(x). 

Denote by / an arbitrary interval. Then, 

p(^w nii (x)^i 

\i=l 

< y exp (-£/i(n^f =1 r^ ([r fc , m (x) - e/,r fc , ro (x) + £^])))l {xer) M =ir -i [(J)} d/i(x) 

M 

+KU r ^( /c )) 

-d/i(x) 



m=l 



< maxue~ 

u 



^(n^ =1 r fcim ([r fcjm (x) - £/ , r fejm (x) + £| ])) 



+/^(U r ^( /c ))- 



m=l 



Using the same arguments as in the proof of Proposition 4.2, the probability 

M 



F (Eli W n)i (X) ^ 1) is bounded by ^ 



tends to infinity since, by assumption, lim^_j. c 



. This bound vanishes as n 
z¥ = oo. □ 
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